-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: improve write_fmt to handle simple strings #121001
Conversation
This comment has been minimized.
This comment has been minimized.
a12e971
to
aaed7ab
Compare
This comment has been minimized.
This comment has been minimized.
aaed7ab
to
75808d1
Compare
Can you run some benchmarks for our original motivating program to see whether this helps? |
This comment has been minimized.
This comment has been minimized.
@bors try |
perf: improve write_fmt to handle simple strings Per `@dtolnay` suggestion in serde-rs/serde#2697 (comment) - attempt to speed up performance in the cases of a simple string format without arguments: ```rust write!(f, "text") -> f.write_str("text") ``` ```diff + #[inline] pub fn write_fmt(&mut self, f: fmt::Arguments) -> fmt::Result { + if let Some(s) = f.as_str() { + self.buf.write_str(s) + } else { write(self.buf, f) + } } ``` Hopefully it will improve the simple case for the rust-lang#99012 CC: `@m-ou-se` as probably the biggest expert in everything `format!`
I tried this code, running it with the latest main branch and compiling using #[no_mangle]
#[inline(never)]
pub fn fmt_abcd(f: &mut Formatter<'_>) -> fmt::Result {
write!(f, "ABCXYZ")
} Before the change: fmt_abcd:
.cfi_startproc
subq $56, %rsp
.cfi_def_cfa_offset 64
leaq .L__unnamed_2(%rip), %rax
movq %rax, 8(%rsp)
movq $1, 16(%rsp)
leaq .L__unnamed_3(%rip), %rax
movq %rax, 24(%rsp)
xorps %xmm0, %xmm0
movups %xmm0, 32(%rsp)
leaq 8(%rsp), %rsi
callq *_ZN4core3fmt9Formatter9write_fmt17hf6272f62fa2a64dfE@GOTPCREL(%rip)
addq $56, %rsp
.cfi_def_cfa_offset 8
retq
.Lfunc_end5: After the change: fmt_abcd:
.cfi_startproc
movq 32(%rdi), %rax
movq 40(%rdi), %rcx
movq 24(%rcx), %rcx
leaq .L__unnamed_2(%rip), %rsi
movl $6, %edx
movq %rax, %rdi
jmpq *%rcx
.Lfunc_end5: |
In the past, this didn't really help (#100700), but we didn't have runtime benchmarks then. |
☀️ Try build successful - checks-actions |
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (09bfbad): comparison URL. Overall result: ✅ improvements - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 662.665s -> 665.299s (0.40%) |
@Kobzol thx, I tried to dig through the graphs - but there is so much data, I am not certain what to look at. The worst test I see is branch misses - it went up by 23773.11% (!!) for What are the next steps for this? Are there any reliable way to figure out what's going on with a certain test? Note that micro benchmarks show nearly 80% perf gain when using |
I would take the runtime benchmarks results with a grain of salt. That being said, a lot of branch misses makes sense. The call is resolved with dynamic dispatch, through a v-table, and this condition might be very badly predicted, because it will be true for some values and false for others, quite randomly. |
my understanding was that this new |
Sorry, my point about the v-table was wrong, there's no dynamic dispatch, the function is implemented on On the other hand, the |
I wonder if we can use |
write! with a single string argument is not properly optimized and using write_str generates better code: serde-rs/serde#2697 rust-lang/rust#121001
write! with a single string argument is not properly optimized and using write_str generates better code: serde-rs/serde#2697 rust-lang/rust#121001
This comment has been minimized.
This comment has been minimized.
Finished benchmarking commit (dc99c35): comparison URL. Overall result: ❌ regressions - no action neededBenchmarking this pull request likely means that it is perf-sensitive, so we're automatically marking it as not fit for rolling up. While you can manually mark this PR as fit for rollup, we strongly recommend not doing so since this PR may lead to changes in compiler perf. @bors rollup=never Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesThis benchmark run did not return any relevant results for this metric. Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 640.589s -> 639.335s (-0.20%) |
Are these good results or not? Or are we not even doing relevant perf testing for the simple |
I don't think that our current benchmarks in rustc-perf are a good fit for a change like this. Perhaps a micro-benchmark in stdlib would be better. That being said, even a microbenchmark won't tell us the effects on non-trivial programs. That's why it's challenging to do changes like this. |
@Kobzol are the assembly output changes a good indicator? See my comment above #121001 (comment) |
It's definitely a way of evaluating the potential perf. effects of this change :) I'm not a libs reviewer, so I'll leave the evaluation of this change to others. |
@bors r+ |
☀️ Test successful - checks-actions |
Finished benchmarking commit (5a1e544): comparison URL. Overall result: ❌✅ regressions and improvements - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Binary sizeResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Bootstrap: 645.309s -> 645.372s (0.01%) |
Regressions look real, but are perhaps expected -- LLVM can optimize more with this change, so it'll spend more time doing that. I'm going to mark as triaged. |
Are the binary size regressions expected? |
@nnethercote thx for the ping, I had another go at in-depth assembly digging - it seems the inlined fn with |
@nyurik: thanks for looking into it! |
Optimize write with as_const_str for shorter code Following up on rust-lang#121001 Apparently this code generates significant code block for each call to `write()` with non-simple formatting string - approx 100 lines of assembly code, possibly due to `dyn` (?). See generated assembly code [here](https://github.com/nyurik/rust-optimize-format-str/compare/before-changes..with-my-change#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477): <details><summary>Details</summary> <p> This is the inlining of `write!(buffer, "Iteration {value} was written")` ```asm core::fmt::Write::write_fmt: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194 fn write_fmt(&mut self, args: Arguments<'_>) -> Result { push r15 push r14 push r13 push r12 push rbx mov rdx, rsi // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427 match (self.pieces, self.args) { mov rcx, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 24] // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428 ([], []) => Some(""), cmp rcx, 1 je .LBB0_8 test rcx, rcx jne .LBB0_9 test rax, rax jne .LBB0_9 // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); lea r12, [rdi + 16] lea rsi, [rip + .L__unnamed_2] xor ebx, ebx .LBB0_6: mov r14, qword ptr [r12] jmp .LBB0_7 .LBB0_8: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429 ([s], []) => Some(s), test rax, rax je .LBB0_4 .LBB0_9: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108 if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) } lea rsi, [rip + .L__unnamed_1] pop rbx pop r12 pop r13 pop r14 pop r15 jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL] .LBB0_4: mov rax, qword ptr [rdx] // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429 ([s], []) => Some(s), mov rsi, qword ptr [rax] mov rbx, qword ptr [rax + 8] // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248 if T::IS_ZST { usize::MAX } else { self.cap.0 } mov rax, qword ptr [rdi] // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); mov r14, qword ptr [rdi + 16] // /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281 uint_impl! { sub rax, r14 // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392 additional > self.capacity().wrapping_sub(len) cmp rax, rbx // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309 if self.needs_to_grow(len, additional) { jb .LBB0_5 .LBB0_7: mov rax, qword ptr [rdi + 8] // /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046 unsafe { intrinsics::offset(self, count) } add rax, r14 mov r15, rdi // /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922 copy_nonoverlapping(src, dst, count) mov rdi, rax mov rdx, rbx call qword ptr [rip + memcpy@GOTPCREL] // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040 self.len += count; add r14, rbx mov qword ptr [r15 + 16], r14 // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216 } xor eax, eax pop rbx pop r12 pop r13 pop r14 pop r15 ret .LBB0_5: // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); lea r12, [rdi + 16] mov r15, rdi mov r13, rsi // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310 do_reserve_and_handle(self, len, additional); mov rsi, r14 mov rdx, rbx call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle mov rsi, r13 mov rdi, r15 jmp .LBB0_6 ``` </p> </details> ```rust #[inline] pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result { if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) } } ``` So, this brings back the older experiment - where I used `if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }` helper function, and called it in multiple places that used `write`. This is not as optimal because now every user of `write` must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See [assembly comparison](https://github.com/nyurik/rust-optimize-format-str/compare/with-my-change..with-as-const-str#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477) of what is now with what this change brings (focus only on `fmt/intel-lib.txt` and `str/intel-lib.txt` files). ```rust if let Some(s) = args.as_const_str() { self.write_str(s) } else { write(self, args) } ```
Optimize write with as_const_str for shorter code Following up on rust-lang#121001 Apparently this code generates significant code block for each call to `write()` with non-simple formatting string - approx 100 lines of assembly code, possibly due to `dyn` (?). See generated assembly code [here](https://github.com/nyurik/rust-optimize-format-str/compare/before-changes..with-my-change#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477): <details><summary>Details</summary> <p> This is the inlining of `write!(buffer, "Iteration {value} was written")` ```asm core::fmt::Write::write_fmt: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 194 fn write_fmt(&mut self, args: Arguments<'_>) -> Result { push r15 push r14 push r13 push r12 push rbx mov rdx, rsi // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 427 match (self.pieces, self.args) { mov rcx, qword ptr [rsi + 8] mov rax, qword ptr [rsi + 24] // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 428 ([], []) => Some(""), cmp rcx, 1 je .LBB0_8 test rcx, rcx jne .LBB0_9 test rax, rax jne .LBB0_9 // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); lea r12, [rdi + 16] lea rsi, [rip + .L__unnamed_2] xor ebx, ebx .LBB0_6: mov r14, qword ptr [r12] jmp .LBB0_7 .LBB0_8: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429 ([s], []) => Some(s), test rax, rax je .LBB0_4 .LBB0_9: // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 1108 if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) } lea rsi, [rip + .L__unnamed_1] pop rbx pop r12 pop r13 pop r14 pop r15 jmp qword ptr [rip + core::fmt::write_internal@GOTPCREL] .LBB0_4: mov rax, qword ptr [rdx] // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 429 ([s], []) => Some(s), mov rsi, qword ptr [rax] mov rbx, qword ptr [rax + 8] // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 248 if T::IS_ZST { usize::MAX } else { self.cap.0 } mov rax, qword ptr [rdi] // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); mov r14, qword ptr [rdi + 16] // /home/nyurik/dev/rust/rust/library/core/src/num/mod.rs : 1281 uint_impl! { sub rax, r14 // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 392 additional > self.capacity().wrapping_sub(len) cmp rax, rbx // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 309 if self.needs_to_grow(len, additional) { jb .LBB0_5 .LBB0_7: mov rax, qword ptr [rdi + 8] // /home/nyurik/dev/rust/rust/library/core/src/ptr/mut_ptr.rs : 1046 unsafe { intrinsics::offset(self, count) } add rax, r14 mov r15, rdi // /home/nyurik/dev/rust/rust/library/core/src/intrinsics.rs : 2922 copy_nonoverlapping(src, dst, count) mov rdi, rax mov rdx, rbx call qword ptr [rip + memcpy@GOTPCREL] // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 2040 self.len += count; add r14, rbx mov qword ptr [r15 + 16], r14 // /home/nyurik/dev/rust/rust/library/core/src/fmt/mod.rs : 216 } xor eax, eax pop rbx pop r12 pop r13 pop r14 pop r15 ret .LBB0_5: // /home/nyurik/dev/rust/rust/library/alloc/src/vec/mod.rs : 911 self.buf.reserve(self.len, additional); lea r12, [rdi + 16] mov r15, rdi mov r13, rsi // /home/nyurik/dev/rust/rust/library/alloc/src/raw_vec.rs : 310 do_reserve_and_handle(self, len, additional); mov rsi, r14 mov rdx, rbx call alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle mov rsi, r13 mov rdi, r15 jmp .LBB0_6 ``` </p> </details> ```rust #[inline] pub fn write(output: &mut dyn Write, args: Arguments<'_>) -> Result { if let Some(s) = args.as_str() { output.write_str(s) } else { write_internal(output, args) } } ``` So, this brings back the older experiment - where I used `if core::intrinsics::is_val_statically_known(s.is_some()) { s } else { None }` helper function, and called it in multiple places that used `write`. This is not as optimal because now every user of `write` must do this logic, but at least it results in significantly smaller assembly code for the formatting case, and results in identical code as now for the "simple" (no formatting) case. See [assembly comparison](https://github.com/nyurik/rust-optimize-format-str/compare/with-my-change..with-as-const-str#diff-6b404e954c692d8cdc8c452d819a216aa5dcf40522b5944639e9ad947279a477) of what is now with what this change brings (focus only on `fmt/intel-lib.txt` and `str/intel-lib.txt` files). ```rust if let Some(s) = args.as_const_str() { self.write_str(s) } else { write(self, args) } ```
In case format string has no arguments, simplify its implementation with a direct call to
output.write_str(value)
. This builds on @dtolnay original suggestion. This does not change any expectations because the originalfn write()
implementation callswrite_str
for parts of the format string.fmt::Arguments::as_str
in thewrite!
macro #100700CC: @m-ou-se as probably the biggest expert in everything
format!